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Abstract 

One challenge of large-scale data analysis is that the assumption of an identical distribu¬ 
tion for all samples is often not realistic. An optimal linear regression might, for example, 
be markedly different for distinct groups of the data. Maximin effects have been proposed 
as a computationally attractive way to estimate effects that are common across all data 
without fitting a mixture distribution explicitly. So far just point estimators of the common 
maximin effects have been proposed in Meinshausen and Biihlmann (2014|. Here we propose 
asymptotically valid confidence regions for these effects. 


1 Introduction 

Large-scale regression analysis often has to deal with inhomogeneous data in the sense that 
samples are not drawn independently from the same distribution. The optimal regression co¬ 
efficient might for example be markedly different in distinct groups of the data or vary slowly 
over a chronological ordering of the samples. One option is then to either model the exact 


variation of the regression vector with a varying-coefficient model in the latter case (Hastie and 


Tibshirani 

[TM3 

Fan and Zhang 

1999 

) or to fit a mixture distribution 

and Rubin 

1985 

McLachlan and Peel 

2004 

Figueiredo and Jain 

2002) 


For large-scale anal¬ 
ysis with many groups of data samples or many predictor variables this approach might be too 
expensive computationally and also yield more information than necessary in settings where 
one is just interested in effects that are present in all sub-groups of data. A maximin effect was 


defined in Meinshausen and Biihlmann (2014) as the effect that is common to all sub-groups 


of data and a simple estimator based on subsampling of the data was proposed in Biihlmann] 
and Meinshausen (2014). However, the estimators for maximin effects proposed so far just yield 


point estimators but we are interested here in confidence intervals. While we are mostly dealing 
with low-dimensional data where the sample size exceeds the number of samples, the results 
could potentially be extended to high-dimensional regression using similar ideas as proposed for 


example in Zhang and Zhang (2014) or Van de Geer et al. (2014) for the estimation of optimal 


linear regression effects for high-dimensional data. 


1.1 Model and notation 


We first present a model for inhomogeneous data as considered in Meinshausen and Biihlmann 


(2014). Specifically, we look at a special case where the data are split into several known groups 
5 = 1,..., G. In each group g, we assume a linear model of the form 


Yg = Xg6° + Eg, 


( 1 ) 


1 




























































where is a n-dimensional response vector of interest, a deterministic p-dimensional regres¬ 
sion parameter vectors and Xg a re x p-dimensional design matrix containing in the columns 
the re observations of p predictor variables. The noise contributions Eg are assumed to be inde¬ 
pendent with distribution A/’„(0, cj^Id„). We assume the sample size re to be identical in each 
group. Generalizations to varying-coefficient models (Hastie and Tibshirani, 1993 Fan and 


Zhang, 1999) are clearly possible but notationally more cumbersome. Inhomogeneity is caused 


by the different parameter vectors in the group. We define X as the row-wise concatenation of 
the design matrices Xi,X 2 ,... jX^ and assume that the groups are known, that is we know 
which observations belong to the groups g = 1,... ,G, respectively. For the distribution of X 
g = 1,..., G we consider different scenarios. 


9 ’ 


Scenario 1. Random design. The observations of the predictor variables are independent 
samples of an unknown multivariate distribution F with finite fourth moments. We assume this 
distribution to be common across all groups g = 1,... ,G. 


Scenario 2. Random design in each group. The observation in each group are independent 
samples of an unknown distribution Fg with finite fourth moments. Observations in different 
groups are independent. The distribution Fg may be different in different groups. 

In the following if not mentioned otherwise we assume Scenario 1. The generalization to 
Scenario 2 is to a large extent only notational. 


1.2 Aggregation 


The question arises how the inhomogeneity of the optimal regression across groups is taken 
into account when trying to estimate the relationship between the predictor variables and the 


outcome of interest. Several known alternatives such as mixed effects models (Pinheiro and 


Bates 2000), mixture models ( McLachlan and Peel| 2004) and clusterwise regression models 
|DeSarbo and Cron, [1988) are possibilities and are usefnl especially in cases where the group 


structure is unknown. They are at the same time computationally quite demanding. 

A computationally attractive alternative (especially for the discussed case of known groups 
but also more generally) is to estimate the optimal regression coefficient separately in each 
group, which are either known (as assumed in the following) or sampled in some appropriate 
form (Meinshausen and Biihlmann, 2014). As estimates for the we use in the following 


standard least squares estimators 

bg = argmin \\Yg - ^gb\\l. 
beRP 

The restriction to this estimator is only for the purpose of simplicity. Regularization can be 
added if necessary but the essential issues are already visible for least-squares estimation. 

Now a least-squares estimator is obtained in each group of data and the question is how 
these different estimators can be aggregated. The simplest and perhaps most widely-used 


aggregation scheme is bagging (bootstrap aggregation), as proposed by Breiman (1996), where 
the aggregated estimator is given by 


Bagging : 


b-.= Y,Wgbg, 


where Wg = — \/g = 1,... ,G. 

Lt 


(2) 


If the data from different groups originate from an independent sampling mechanism, the bag¬ 
ging is a nseful aggregation scheme. In particnlar, computing the bagged estimator is com¬ 
putationally more attractive than computing a single least-squares estimator as it allows the 
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data to be split up into distinct subsets and processed independently before the aggregation 
step. For inhomogeneous data, the variability of the estimates bg for g = 1,... ,G allows to 
gain some insight into the nature of the inhomogeneity. However, as argued in [Biihlmann and 


Meinshausen (2014), averaging is the wrong aggregation mechanism for inhomogeneous data. 


1.3 Maximin effect and magging 


For inhomogeneous data, instead of looking for an estimator that works best on average, Mein¬ 


shausen and Biihlmann (2014) proposed to aim to maximize the minimum explained variance 


across several settings g = 1,... ,G. To be more precise, in our setting, 

b„ 


^maximin := arg max min V{b,b\ 
beRp g=i,--,G 


where V{b, b^) is the explained variance in group g (with true regression vector bg) when using 
a regression vector b. That is 

V{b,b^g) :=E\\Yg\\l-E\\Yg-Xgbf, 

= 26 ‘S° 6 ° - 

where S® := EE with E := (nG)~^X*X is the sample covariance matrix. In words, the maximin 
effect is defined as the estimator that maximises the explained variance in the most adversarial 
scenario (“group”). In this sense, the maximin effect is the effect that is common among all 
groups in the data and ignores the effects that are present in some groups but not in others. It 
was shown in Meinshausen and Biihlmann (2014) that the definition above is equivalent to 

b^E^b, 


Vaximin = arg min 
beCVX(BO) 

where = (6^, • • •, b^) G the matrix of the regression parameter vectors and GVX[B^) 

denotes the closed convex hull of the G vectors in B^. The latter definition motivates maximin 


aggregating, or magging (Biihlmann and Meinshausen, 2014), which is the convex combination 
that minimizes the ^ 2 -norm of the fitted values: 


G G 

Magging: b:=''^^agbg, where a := argmin || agX5g||2 and 
9=1 

Cg := {a G : mina^ > 0 and ag = 1} 

^ 9 

The magging regression vector is unique if X*X is positive definite. Otherwise, we can only 
identify the prediction effect Xbmaximin and the solution above is meant to be any member of 
the feasible set of solutions. To compute the estimator, the dataset is split into several smaller 
datasets and we assume here that the split separates the data into already known groups. After 
computing estimators on all of these groups separately, possibly in parallel, magging can be used 
to find common effects of all datasets. This is in particular interesting if there is inhomogeneity 
in the data. For known groups, as in our setting, magging can be interpreted as the plug-in 
estimate of the maximin effect. 

In the following we need additional notation. For B := (6i,...., be) G and for E G 

positive definite define 

M-s{B) := argmin b^Eb 
beCVX(B) 

We obtain the original definition of the magging estimator for M^[B) with B = (bi,... ,bc) 
and the maximin effect with Mj^o{B^). 
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1.4 Novel contribution and organization of the paper 

So far only point estimators of maximin effects have been proposed in the literature. In Sectionj^ 
we discuss an asymptotic approach to construct confidence regions for the maximin effect. 
Specifically, we calculate the asymptotic distribution of ^yn{M^{B) — My,o{B^)) and derive 
corresponding asymptotically valid confidence regions. This gives us (asymptotically) tight 
confidence regions and will shed more light on the (asymptotic) nature of the fluctuations of the 
magging estimator. We evaluate the actual coverage of this approximation on simulated datasets 
in Section The proofs of the corresponding theorems and an alternative non-asymptotic 
approach can be found in the appendix. The advantages and disadvantages of the approaches 
are discussed in Section |4l 


2 Confidence intervals for maximin effects 


In Scenario 1, the random design of the predictor variables is identical across all groups of data. 
For fixed G and n —>■ oo, we can then use the delta method to derive the asymptotic distribution 
of the scaled difference between the true and estimated magging effects 


V^(M^(.B) -Mso(5°)). 


This in turn allows to construct confidence intervals for the true maximin effects. Let W{B, S) 
be a consistent estimator of the (positive definite) variance of the Gaussian distribution lim„_^oo V^iMf.{B) — 
Myp{B^)). Let a > 0. Choose r as the (1 — a)-quantile of the Xp-distribution. Define then a 
confidence region as 


C(S, B) := {M e RP : {M^{B) - MfW{B, t)-^{M^{B) - M) < ^} (3) 


The definition of W{B, S) is deferred to the appendix, Section 5.1 We will show in the following 


that we obtain asymptotically valid confidence intervals with this approach. For simplicity, we 
work with Scenario 1 here and assume that the noise contributions Eg in equation 0 are 
independent with distribution A/)i(0, cr^Idn). Furthermore, each G is assumed to have 

full rank, requiring p < n. Though the framework for the result is a Gaussian linear model, it 
can be easily extended to more general settings. 

The following theorem describes the coverage properties of the confidence interval Q. In 
the following, for x,y gW and S G positive definite define {x,y)-£ := x^T,y. 


Theorem 1. Let be positive definite. Let M-^q{B^) = with Ug > 0, ^ 

and let this representation be unique. Let 7 ^ 0}| > 1. Suppose that the hyperplane 

orthonormal to the maximin effect contains only “active” 5°, i.e. { 6 ^ : g = 1,...,G} n {M G 
W -.{M- Mso(50),Mso(.B°))so = 0} C { 6 ° : Ug 0}. Then 


lim P[Mso (S°) G C(S, B)] = l- a. 


In other words, the set defined in Q is an asymptotically valid confidence region for M^o {B^) 
under the made assumptions. If the true coefficients 6 ^ in each group are drawn from a multi¬ 
variate density, then the assumptions are fulfilled with probability one. 

The special case |{g : Og 7 ^ 0}| = 1 is excluded, as the magging estimator is identical 
to a solution in one individual group in this case, which is equivalent to M^{B) = bg for 
a g G {!,..., G}, up to an asymptotically negligible set. This case is mainly excluded for 
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Figure 1: An illustration of Theorems\^ and\^ On the left hand side the blue dots represent 
3000 realizations ofbg, g' = 1,2,3 with dimension p = 3. The black dots are the corresponding 
magging estimates The green line indicates the true maximin effect My,o{B^). On the 

right hand side, the black line indicates one of the M^{B) with the corresponding approximate 
95%-confidence region caleulated with the terms of equation (§. 


notational reasons. The assumptions of Theorem guarantee that the derivative of magging 
Ms{B) exists and is continuous at B^ and If the latter condition is violated, it is still possible 
to obtain asymptotic bounds in the more general setting, as lim„_^oo — M-^o{B^)) is 

still subgaussian. We explore the violation of these assumptions with simulation studies in the 
next section. The proof of Theorem is an application of Slutsky’s Theorem, combined with 
the following result about the asymptotic variance of the magging estimator. 

Theorem 2. Let the assumptions of Theorem^ be true. Then, for n oo, 

g&A{B0,Y:0) 

( 4 ) 


Here 

pendix, see Section 


Dg denotes the differential in direction bg. 


5.1 


This derivative is calculated in the ap- 
The set A(B,'E) C {1,...,G} denotes indices g for which bg has 


nonvanishing coefficient ag in one of the convex combinations My,{B) = 

otg > 0, G^g Note that by the assumptions of Theorem [T| this convex combination 

is unique for The definition of V is somewhat lengthy and can be found 


in the appendix. Section 5.1 


The first summand in the variance in formula Q is due to fluctuations of the estimator 
of B^, the second summand is due to fluctuations of the estimator of If is known in 
advance, we can use S := and in the theorem above V = 0. Table is an illustration of 
Theorem [2j 


3 Numerical Examples 

The aim of this section is to evaluate the actual coverage of the approximate confidence regions 
as defined above. We study several examples. They have in common that the entries in X are 
i.i.d. AA(0,1). Furthermore the Eg are i.i.d. AA(0, Id„) and independent of X. The tables show 
the coverage of the true maximin effect M-£o{B^) by the proposed 95% confidence regions. We 
calculate the confidence intervals only for p < n scenarios as long as least squares estimators 
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are used (Tables 1-3), while the case of p > n is covered in tables 4 and 5 by the use of a ridge 
penalty. All simulations were run 1000 times. 

In the setting of Table [T] all assumptions of Theorem [T] are satisfied. As expected, for large p 
the convergence of the actual coverage seems to be slower. Note that for validity of Theorem 
it is not necessary that p = G, as we have asymptotically tight coverage for all 1 < G < p. 



re = 5 10 

15 

100 

200 

500 

1000 

2000 

4000 

p = 3 

0.70 0.78 

0.82 

0.92 

0.94 

0.95 

0.94 

0.94 

0.95 

5 

0.69 

0.76 

0.90 

0.93 

0.95 

0.94 

0.95 

0.95 

10 


0.62 

0.84 

0.88 

0.94 

0.95 

0.96 

0.94 

15 



0.78 

0.85 

0.93 

0.92 

0.95 

0.95 

20 



0.72 

0.83 

0.90 

0.91 

0.95 

0.94 

40 



0.54 

0.63 

0.79 

0.88 

0.91 

0.94 

80 



0.57 

0.38 

0.50 

0.74 

0.85 

0.92 


Table 1: = eg, g = 1,... ,G = p, where the Cg denote the vectors of the standard basis, 1000 

iterations. The coverage can be seen to be approximately correct if n is sufficiently large. 



re = 5 10 

15 

100 

200 

500 

1000 

2000 

4000 

p = 3 

0.64 0.84 

0.91 

0.97 

0.96 

0.82 

0.98 

0.96 

0.97 

5 

0.61 

0.79 

0.99 

0.97 

0.88 

0.82 

0.91 

1.00 

10 


0.23 

0.99 

0.99 

1.00 

0.99 

0.93 

0.98 

15 



0.99 

0.99 

1.00 

1.00 

0.99 

0.99 

20 



0.99 

1.00 

0.99 

1.00 

1.00 

0.99 

40 



0.94 

1.00 

1.00 

1.00 

1.00 

1.00 

80 



0.00 

1.00 

1.00 

1.00 

1.00 

1.00 


Table 2: 6° = ei -|- Zge 2 , g = 1,... ,G = p, Zg ^ AA(0,1) independent. The assumptions are 
violated, yielding too conservative confidence intervals. The 0.00 at n = 100, p = 80 is due to 
a large bias of M^{B) towards 0. For larger n, however, this bias quickly vanishes and we get 
the desired coverage (starting at approximately n = 120). 



re = 5 10 

15 

100 

200 

500 

1000 

2000 

4000 

p = 3 

0.76 0.87 

0.90 

0.99 

0.99 

0.99 

1.00 

1.00 

1.00 

5 

0.65 

0.78 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

10 


0.33 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

15 



0.99 

1.00 

1.00 

1.00 

1.00 

1.00 

20 



0.99 

1.00 

1.00 

1.00 

1.00 

1.00 

40 



0.93 

1.00 

1.00 

1.00 

1.00 

1.00 

80 



0.00 

1.00 

1.00 

1.00 

1.00 

1.00 


Table 3: bg = ei, g = 1,... ,G = [0.8p]. The assumptions are again violated and coverage is 
too high. At p = 80 and n = 100 we observe the same effect as in Table In this scenario 
the estimated conhdence regions can become arbitrarily large. This stems from the fact that 
if some of the bg corresponding to A{B, S) are very close, the estimated variance of magging 


may become large. In this setting a different approach, for example as discussed in Section 5.4 
makes more sense. 
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n = 5 

10 

15 

100 

200 

500 

1000 

2000 

4000 

p = 3 

0.71 

0.77 

0.84 

0.92 

0.94 

0.96 

0.95 

0.94 

0.93 

5 

0.74 

0.69 

0.76 

0.90 

0.94 

0.94 

0.95 

0.95 

0.95 

10 

0.55 

0.70 

0.60 

0.86 

0.88 

0.93 

0.94 

0.94 

0.95 

15 

0.52 

0.53 

0.70 

0.77 

0.86 

0.91 

0.94 

0.95 

0.95 

20 

0.53 

0.48 

0.52 

0.73 

0.81 

0.89 

0.93 

0.92 

0.94 

40 

0.40 

0.47 

0.37 

0.52 

0.62 

0.81 

0.87 

0.90 

0.94 

80 

0.20 

0.40 

0.37 

0.56 

0.38 

0.52 

0.72 

0.84 

0.90 


Table 4: 5° = e^, 5 = 1,..., G = p. The diagonal elements of S and where increased by 
a value 10“^ in order to make them invertible and not too ill-conditioned for n < p. Again, 
coverage is approximately correct for n sufficiently large. 



n = 5 

10 

15 

100 

200 

500 

1000 

2000 

4000 

p = 3 

41.70 

2.97 

1.59 

0.59 

0.53 

0.49 

0.47 

0.47 

0.46 

5 

831.50 

13.52 

4.83 

0.42 

0.34 

0.30 

0.28 

0.26 

0.26 

10 

6.56 

1935.77 

27.78 

0.29 

0.20 

0.16 

0.14 

0.13 

0.12 

15 

0.29 

19.83 

3844.87 

0.26 

0.16 

0.12 

0.10 

0.09 

0.08 

20 

0.08 

4.25 

41.04 

0.29 

0.15 

0.09 

0.08 

0.07 

0.06 

40 

0.01 

0.04 

4.61 

2.71 

0.16 

0.07 

0.05 

0.04 

0.03 

80 

0.00 

0.00 

0.01 

205.85 

1.09 

0.06 

0.03 

0.02 

0.02 


Table 5: This table shows the average maximum eigenvalues of the estimated covariance matrix 
of — M^{B)), analogous to Table|^ 


In Table and Table we explore the violation of one of the assumptions in Theorernj^ The 
maximin effect is Mj^o{B^) = (1,0,0 ...), and the convex combination Mj^o{B^) = Ylg=i^g^g 
with (Ug > 0, ^ = 1 is not unique. In both cases, this seems to lead to too conservative confi¬ 

dence regions. Generally, in these settings the difficulty arises from the fact that the derivative of 
Ms{B) does not exist at My,q{B'^). As a result, the fluctuations of lim„ ^Jn{MY,'a{B^) — M^{B)) 
- provided that this limit exists - are not necessarily Gaussian anymore. 

In the last simulation, depicted in Table the bg, g = 1,...,G were not calculated by 
ordinary least squares but ridge regression. The diagonal elements of S and where increased 
by a value 10“'^ in order to make them invertible and not too ill-conditioned for n < p. Apart 
from that we used the same setting as in Table As in Table for large n the coverage 
seems to be (approximately) correct but severe undercoverage can still occur for n p. In 
these high-dimensional settings, the tuning ridge parameter would need to be better adjusted 
for a useful balance between bias and variance and the bias of the ridge penalty would have to 
be adjusted for, something which is beyond the current scope. In Table the corresponding 
maximum eigenvalues of the estimated variance of y/n{MY,o{B^) — Mj,{B)) were plotted, each 
entry being the average over all 1000 runs. We observe a spike for p = n. This peaking is similar 
to a related effect in ridge and lasso regression. Specifically, for fixed p and varying n, the norm 
of the regression estimate is growing as n is increased, reaching its peak at approximately p = n 
while then decreasing again as the solution converges towards the true parameter as n grows 
very large. 
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4 Discussion 


We derived the asymptotic distribution of the magging estimator and proposed asymptotically 
tight and valid confidence regions for the maximin effect. The corresponding theorems requires 
a rather weak assumption on the true regression coefficients b^,..., Iq. However, if this assump¬ 
tion is not satisfied, as studied in simulations, the resulting confidence regions seem to become 
too conservative. Especially when all of the “active” vectors {bg : g G ^(S, B)} are very close to 
each other, the proposed confidence regions tend to become large. Furthermore, in this scenario 
the magging estimator may suffer from a large bias. Then it may make more sense to use an 
approach based on relaxation. Such an approach is outlined in the appendix in Section 5^ and 
it would also allow for non-asymptotic confidence intervals at the price of coverage probabilities 
well above the specihed level. The proposed asymptotic confidence interval on the other hand 
is arguably more intuitive and yields in most scenarios tight bounds for large sample sizes. 
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5 Appendix 


The structure is as follows: The first part is devoted to the most important definitions and 
explicit formulas which were omitted in the main section of the paper. The second part contains 
the proof of Theorem and several lemmata. The third part contains the proof of Theorem 
Finally, the last part contains a relaxation-based idea to construct confidence intervals for 
maximin effects. 


5.1 Definitions and formulas 
Definition 1. A(B,T,) 

The set A{B,T,) C G} denotes indices g for which bg has nonvanishing coefficient 

ag in one of the convex combinations My,{B) = J2g=i X2^9^9 Sg=i G ^9 ~ 

Note that by the assumptions of Theorem^ or Theorem^ the Og are unique for My,o{B^). 

Definition 2. W{B,Ti) 

W{B, E) is a consistent estimator of the variance o/lim„ ^/n{M^{B) — My,o{B^)), see proof 
of Theorem^^ 

W{B, E) = cT^ ^ BgM^{BAiB,E)) + S) 

g&A{B,T.) 

Definitions and explicit formulas of these terms can be found below. We estimate TP by Yi = 
^X*X. B^gM-s{B) denotes the derivative of My,{B) with respect to bg. 


Explicit formula for E). (Compare with Lemma 5) 

Consistent estimator of the additional variance of lim„ ^/n{M^{B) — Mj^o{B^)) “caused” by 
not knowing E^, see proof of Theorem]^ and Lemma 


where C is the empirical covariance matrix of the p-dimensional vectors ;^X^,Xfc. M^{B), 
A: = 1,..., {nG). Furthermore, with B = B^^^^ G' = \A{B., S)|: 

b := (62, • • •, bc ') - (61,..., 61). 


Explicit formula for (Compare with Lemma 

Let us again write B = B^^^ , G' = |^(i?,E)|, 


1 ) 






(Id-PA ^^^)63 


MtiBf 


+ 


||(Id - ll(Id - 

\\{ld-FA^^^)M^{B)\\^ 


||(Id-PA^®V: 


n 


B- 


alls 


Here, PA^®^ denotes the affine projection on the smallest affine space containing 61 ,..., bg-i, bg+i ,..., be. 
Let G denote the projection on (62 — &i, • ■ ■, be — b)'^- These geometric definitions 

are meant with respect to the scalar product {x,y)j. = x^Yy. 


9 








5.2 Proof of Theorem!^ 


Proof. The proof is based on the delta method. As B ^ and S ^ by Lemma [2 
A{B^, = A{B, E) up to an asymptotically negligible set. Hence Mj]o(i?°) = M-^q{B^^^q ^q), 

and Mj.{B) = ^o)) up to an asymptotically negligible set. So without loss of general¬ 

ity let us assume (without changing the definition of E) t 


and hence B^ = B^^^q 


B = B 


A{B,ty 


differentiable in a neighborhood of B^ and E*^. 
we can write 


hat EO) = A{B, E) = {1,..., G}, 
and Lemma 3, My,{B) is continuously 
Using Taylor in a neighborhood of B^ and E*^ 


By Lemma 


E 


y/n[M^{B) - M^o{B^)^ = DbMh( 0 \/u(.B - 

+ DsMh( 0 \/^(S - eO)+ op(l) 

- DbM^o{B^))^{B - 5°) 

+ (DeMh (0 - DsMso(hO))V^(S - eO) 

+ DbMso(H°)\/^(H-H°) 

+ DsMso(H°)V^(S - E°) +Op(l), 

with ^ = 7 H*^+( 1 — 7 )H and H = 7 E°+( 1 — 7 )S for some random variable 7 G [0,1]. We now want 
to show that the first and second term are negligible, and calculate the asymptotic Gaussian dis¬ 
tributions of the last two terms. Furthermore we want to show that the last two terms are asymp¬ 
totically independent. This guarantees that the variance of lim„ ^Mj.{B) — My,o{B^)^ is the 
sum of the variances of the two asymptotic Gaussian distributions. 

Hence, to prove Q it suffices to show: 

( 1 ) DbMs (0 - DbMso(HO) =Op(l) 

( 2 ) DeM5(0 -DeMso(HO) =Op(l) 

(3) y/n{bg - 6 °) ^ AA(0, cr^(E°)“^) for c/ = 1,..., G. 

(4) -AA(o,u2^^g^(^o,EO)D*Mso(HO)(EO)-iD,Mso(HO)) 

(5) DsMso(HO)V^(S - EO) ^ AA(0, U(H0, E^)) 

( 6 ) For 5n := ^/n{B — B^) and := -v/n(E — E*^) we have ((5n,A„) ^ (<5, A) with 6 g, g = 
1,... ,G and A independent. 

Part (1) and (2): By Lemma and Lemma the derivatives are continuous at B^ and 
E^ and S —)• E^, H —)• B^ in probability (which implies ^ —)• B^ and H —?■ E^ in probability). 
Part (3): This is immediate, as under the chosen model, conditioned on X, 

bg^M{bg,a\XlXg)-^) 

and ^XgXg —)> E in probability. 

Part (4): Part (3) and a linear transformation. 

Part (5): We defer this part to Lemma 

Part ( 6 ): We saw the convergence of 5n in part (3). The convergence of A„ is deferred to 
Lemma In the following we use the notation 5 = { 61 ,..., 6 g) and dn = { 6 n,i, ■ ■ ■, Sn,G)- For 
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the asymptotic independence of part ( 6 ). we have to show that for any bounded continuous 
function g, 


Eg{6n,An) 




(detS 0 )G /2 

(27ra2)G/2 


G 

n 

9=1 



d5i ■ ■ ■ (i<5GlF’[dA]. 


In the following equation the inner integral is bounded by 2, and for n —)> oo, ^^g'Kg —)■ S® in 
probability. Hence, by dominated convergence on the inner and outer integral, 



v/detiX^X, 

( 2 vra 2 )i /2 



2 na‘^ 



1^ VdetS^ 
n ( 2 ,cr 2 )V 2 


exp 



\d5n,l ■ ■ ■ d(5n,GlP[dA„] -s> 0 . 


Using this. 


limsup \Eg{5n, A^) - g{5, An)| = 0 , 

n^oo 


where 5 is independent of A^, 5g ~ AA( 0 ,cj^(S°) i.i.d.. Finally, with A independent of d, 

A ~ lim„ - E°), 


limsup \Eg{6n, A„) - Eg{6, A)| 

71^00 

= limsup An) - Eg{S, A)| 

n—^oo 

= lin^sup| J E[(c/((5, An) - g{d, A))|(5] II ddi-'-dScl 

n^oo V i g^i \ / 

= 0 . 


In the second line we used equation (5.2), in the last line we used dominated convergence and 
An ^ A. This concludes the proof. □ 


Let S G be symmetric positive definite. In the following, we work in the Hilbert space 
(Ml’, (•, •)s), where for x,?/ G MP, 

(x,y)E := x*Sy, 

and induced norm 

||x||s = V x^Ex. 

This means that projections and orthogonality etc. are always meant with respect to this space. 
Let PA denote the affine projection on the smallest affine space containing bi,... ,bG- Let 
denote the affine projection on the smallest affine space containing 6 i,..., bg-i, bg+i ,..., bo. 
Note that ioi g = 1 this space can be expressed as 62 + {bs — b 2 ,... ,bG — & 2 )- Let H^ G Mi”^i’ 
denote the projection on (62 — bi,... ,bG — bi)-^. 


Lemma 1. If M^{B) = aibi + ... + UGbG with 0 < ag < 1 for g = I,... ,G > 1 and 
this representation is unique (i.e. B = (6 i,...,6g) has full rank), then Ms is eontinuously 
differentiable in a neighborhood of B with 


E>g,vMj^{B) 


||Ms(H)||s , Ms(H) {Id-PA(s'))bg 

||(/(i-PA(s))Ms(H)||s„ 

-rx- Bbv. 

\\{ld-PA^3))tg\\j, 


(5) 


II 















Here, denotes the differential with respect to the variable bg in direction v. 

Remark 1. In the proof of Theorem^ we could assume that without loss of generality {1,..., G} = 
A{B, E), i.e. B = -Ba(_b,s)- saw that in a neighborhood of B and T,, magging depends only on 
Ba[b,'£)- Hence, for using the formula ofT>gM-£{B) in the context of Theorem^and^ replace 
in the definition B by Ba{b,s)- The derivatives with respect to bg, g ^ {1,, Gj — A[B, S) are 
zero. 


Proof. Without loss of generality, let us assume that g = 1 . We will show that the partial 
derivatives exist and are continuous. 

Let Ai G {h2 — hi,... ,bG — bi)-^ and A2 G {b2 — bi,... ,bG — bf) and define B := (61 + Ai + 

A2, b2,..., be). First, we want to show that, if ||Ai + A2IIS small, 

Afe(B) = PAWAfelB) - - PA(‘))6„ (6) 

Let us denote the r.h.s. by ^(S). We have to show: 

1. ({B) ± (Id-PA(i)) 5 i 

2 . f{B) _L (63 - 62,- ,bG - ^2) 

3 . ^{B) G CVX(R), the convex hull generated by the columns of B. 

Note that 1 . and 2 . guarantee that the r.h.s. in (j^ is perpendicular to the linear space 

generated by the columns of B. 

1 . is trivial. 2 . By definition, (Id — PA^^^)6i A (63 — b2,... .,bG — h2)■ PA^^^Ms(B) A 
(63 — 62,... .,6(3 — 62) as we can decompose into My,{B) = My,{B) — (Id — PA^^^)Ms(i?), 
which are both, by definition, perpendicular to (63 — 62, • • ■ 6 g — 62). 

Now let us show 3 .: M-b{B) = Yl^=i some 0 < and Yl^=i % = i-®- {B^B)~^B^Mji{B) 

(-S|'(6i,...,bG})~^-^s(.B) = a. Similarly, as f,{B) lies on the affine space generated by 61,... ,bG, 
we have f,{B) = ^g^g with Yl^=i “9 = 1 - small ||Ai A A2IIS, B has full rank and as 

fiB) ^ Afs(B), 

. = a. 

Hence, for small ||Ai A A2IIS5 dg > 0 and Yl^=i^g — hence C{B) G CVX(.B) and thus 
Mji{B) = ^{B). This concludes the proof of Q. 

Note that, as Ai A (62 - 61,..., 6 g - h) = {bi - 62, ^3 - ^2, • • ■, - ^2), 

(Id-PA(i))6i = 61- argmin II7 — 61 — Ai — A2III; 

'y&b2 + {b3—b2,...,bG—b2) 

= bi- argmin Hy - 61 - A2III A ||Ai||| 

'Y£b2 + {b3 — b2,...,bG—b2) 

= Ai A(Id-PAW)(6i AA2). ( 7 ) 


(Id - PA(i))(6i a A 2 ) and (Id - PA(^))6 i are linearly dependent. To see this, observe that both 
lie in the one-dimensional space (62 — 61 , • ■ •, 6 g ~ &i) H (63 — 62 , • ■ •, 6 g ~ ^ 2 )"''- This implies that 


(PAC)Afe(B).(Id -PAC))(i,. + A.)), _ ^ ^ 

||(Id-PA<‘))(6i + A2)||| 

= (PA'PAfc(i^),(ld -PAW)i.). _ 

||(Id-PA(i)) 6 i||| 


( 8 ) 
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Now we can put these pieces together: In the first step we use Q and Q, in the second we use 
Ai e {b2-bi,...,bG- bi)-^. 


M^{B) 


^ i ±vj.2_ 


=PA(i)Ms(S) - --—-—-- 

||Ai + (Id-PAW)(6i + A2)rs 

(PA(^)Ms(^), Ai + (Id - PA(^))(6 i + A 2)): 
IIA11|2 + ||(Id - PA(i))(6i + Aoill?,' 


=PA(^)Ms(S) - 


(Ai + (Id-PAW)( 6 i + A 2 )) 
(Ai + (Id-PAW)( 6 i + A 2 )). 


In the first step we do an expansion of the equation above and in the second, we use Q and 
(Id - PA(i))(6i + A2) = (Id - PAW)6 i + 0(||A2||s): 


Ms(B) 

=PAa)MUB) - (PA<‘'M.(B),(M -FA(-»)(i,. + A.)). _ 

||(Id - PA^XI,, + Ajjlll, 

- - PAa),(.. + A,) 

||(Id-PA<‘l)(6, + A2)||p 

_ (PAa)M.(B).(Id -PA'->)(t. + A.)>. ^ 

||(Id-PAl‘))(6i+A2)|||, 

=PA<‘)MUB) - (rAWM.(B).(Id -PAa))i,.>. _ 

||{Id-PAl‘>)6i||| 


.<P A“’Afc(g).A 0 . p^g, 

||(Id-PA(i)) 6 i||| 
(PA(i)Ms(5),(Id-PA«)6i)s 


Ai+ C>(||Ai||i + ||A2||^)). 

||(Id-PA(i))6i||| 

From this and ([^ we obtain 

_(PAd)M.(B).A.), p^g 

||(Id-PA(^))6i||2 
_ (PAd)M.(B).(Id PAW)^>, 

||{Id-PAl‘))6i|||, 

Now let us write Ai + A 2 = 7 U, Ai = 7(Ms(-B)/||Ms(i?)||s + with u_l A Ms{B) and 
v± A {b 2 — bi,... ,bG — bi). By noting that 

{FAWMj:{B), Ai)s = (Ms(i?) A (PA^^) - id)Ms(S), 7nT7^ + 

= 7l|Ms(i?)||E 


and, as (Id — PA(^))Ms(-B) and (Id — PA(^))6 i are linearly dependent (both he in the one¬ 
dimensional space {b 2 — bi,... ,bG — bi) n (63 — 621 • • •, &G ~ ^2)"'')) 

-(PA(i)Ms( 5), (Id - PA(^))6 i)s = ((Id - PA(i))Ms(B), (Id - FA^^^)bi)j: 

= ||(Id - PAW)Ms(B)||s||(Id - PAW)5 i||s. 
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We obtain: 


Ms(5)-Ms(5) 
||Ms(-B)||s 


= -7 


,^)s(Id-PAW)6i 


7- 


||(Id-PAW)6i|||lMs(5)|| 

||(Id - PA(^))Ms(^)||s||(Id - PA(i))bi| 
||(Id-PAW)6i||| 


-Hbv + 0(||Ai|||; + ||A2 ||s)). 


Hence the directional derivative exists and is equal to ([^. The assertion follows by existence 
and continuity of the directional derivatives in a neighborhood of -B. □ 

Lemma 2. Let be positive definite. is continuous in B and Yi in a neighborhood of 

. Furthermore, under the assumptions of Theorem^ (or Theorem^, in a neighborhood of 
B^ and A{B,Y) is constant. 

Proof. First, let us prove that magging is continuous. Proof by contradiction: Assume there 
exist sequences —)• B, —)■ S positive definite such that Msj,(Bfc) M^{B). Without loss 
of generality, as Y is invertible, Msj,(Bfc) converges, too. By definition of M^i_{Bk) we have 

where H^j, denotes the projection (in (•, •)) on the convex set CVX(Bfc). By continuity, 

||limMs,(Bfc)||s<||Ms(B)||s. 

k 

We have Msj.(Bfc) G CVX(Bfc) and hence by continuity limfcMs;,(Bfc) G CVX(B). As magging 
is unique (S is positive definite), this yields a contradiction. 

Consider 6 ° with g G A(B‘^, S*’). By the assumptions of Theoremj^ M^o{B^) = Yli£A{B0 EO) 
with 0 < Qfj < 1. Hence for small 7 G M, (1 — ^)Mj^o{B^) + 76° G CVX(B°) and by definition 
of magging 

I|Mso(bO)||so < 11(1 - 7)Mso(bO) + 76°||eo (9) 

Using this inequality for small 7 > 0 and small 7 < 0 we obtain {My,o{B^), b^ — Mj]o(B^)) = 0. 
Hence, for all g G A{B^, My,o{B^) is perpendicular (with respect to (•, •)so) to 6°—Mso(B°). 
Hence A{B^,Y^) C Mj:,o{B^) + . 

Furthermore, by assumptions of Theorem 0 if g 0 A{B^,Y^) we have bg 0 M-£o{B^) + 
M-^o{B^)-^. By continuity, for B = ( 61 , ...,6g) close to B^ and Y close to (in || • II 2 ) we have 
bg 0 Mj]{B) + Ms(B)-*-. By an analogous argument as in equation Q, g 0 A{B,Y). This 
proves A(B0,S°) C A{B,Y). 

It remains to show A(B, Y) C A(B^, S^): For notational simplicity let us assume A{B^, X^) = 
{1,..., G}. For B close to B° and Y close to My,{B) = Ba with W = 1, 0 < dj < 1 . 

2 ), 0 < di < 1. 
have that B^ 
ose to B^ and Y close to 


We want to show that for B close to B° and Y close to (in 
To this end, note that by the assumptions of Theorem jl 
without loss of generality: B^) has full rank, hence for B c 


we have that B^^^o ^o) (here 


(B*B) ^ B*Ms(B) = d with dj > 0, Yhi W = 1- Furthermore, 


lim (B*B) ^ B^Mj:(B) = ((B°)*B°) ^ (B°)*Mso(B°) = a. 


Hence for B close to B^ and Y close to (in || • II 2 )), 0 < dj < 1. This concludes the proof. □ 
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Lemma 3. Let G > 2. Let Mj]{B) = oifei + ... + ache with unique 0 < < 1 satisfying 

Og = 1. Then the mapping 

{positive definite matrices in —)> 

S ^ Ms{B) 

is continuously differentiable at B, S. Let A be a symmetric matrix. The differential in direction 
A is 

DsMs(5)A = -D{D^J:D)-^D^AMt.{B), 

where 

D-.= ( 62 ,..., 6 g)-( 6 i,..., 6 i). 

Proof. By elementary analysis, it suffices to show that the directional derivatives exist in a 
neighborhood and that they are continuous. 

For a small symmetric pertubation AA, by continuity of magging (Lemma §, MsH_AA(i?) 
has to satisfy 

Mt.+xa{B) = Mj]{B) + 

for some (small) vector 7 G By definition of magging, and as 0 < < 1 we have 

||^s+aa(-B)||s+aa < ||Afs+AA(-B) + L> 7 '||s+aa for all small vectors 7 ' G Hence, 

Ms+aa(5)'(S + AA)Z) = 0. (10) 

Putting these two conditions together, we get 

(Ms(H) + D 7 )*(S + XA)D = 0. 

Furthermore, analogously as in equation ( |10| ) we obtain 

m^{bYt,d = 0 . 

By combining the last two equations, 

7 *L>‘(S + XA)D = -Ms(H)*AAL». 

As Z)*(S + XA)D is invertible {D has full rank as B has full rank. B has full rank as the Ug 
are unique), 

7 * = -Mj:{BYXAD{D\T + XA)D)-\ 

D'y = -D{D\T. + AA)L>)”^D*AAMs(H). 


Dividing by A and letting A —)• 0 gives the desired result. 


□ 


Lemma 4. Let X^. ~ A, A: = 1, ...,nG denote the i.i.d. rows of IK. Let E[||Xj.Xi. H^] < 00 and 
= E[XiXi.] positive definite. Then, for n —>■ 00 , 


.. nG 

k=l 


where the symmetric matrix A has centered multivariate normal distributed entries under and 
on the diagonal with covariance 

Cijki ■= Covar{Aij, Aki) = —E[(XiiXij — E[XiiXij])(XifcXi; — E[XifcXi/])]. 
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Proof. Apply the CLT. 

In the following Lemma, we want to calculate the distribution of 

Lemma 5. Let us use setting of Lemma\^ 

DsMs(5)- S) - J\f{0, V{B, S)) 

with 


□ 


F ( B , s ) = d{d^j:d)-^d^cd{d^t.d)-^d\ 


where 


Cij = ^ M-£{B)kMj:{B)iCikij, 

k,l=l 

is the covariance matrix of AMy,{B) and 

D := ( 62 , ..■,bG)- {bi ,..., 61 ). 

Remark 2. In the proof of Theorem^ we could assume that without loss of generality {1,..., G} 
A{B,T,), i.e. B = i3A(_B,E)- F'or using the definition of V in the context of Theorem^ and^ 
replace in the definition B by '^be G in the definition of C stays the same, i.e. it is 

still the total number of groups. 

Proof. With Lemma and it suffices to calculate the distribution of 

-D{D^T.D)-^D^AMy.{B), 

i.e. the nontrivial part is to calculate the distribution of AMs(R). We know it is Gaussian and 
centered, hence it suffices to determine the covariance matrix: 


E(AMs(R)Ms(R)*A).^. =E Aik{M^{B)M^{B)%iAij 

A:,/=l 
P 

= ^ Mj:,{B)kMj:{B)iEAikAij 
k,l=l 
P 

= ^ M-£{B)kMs{B)iCikij. 
k,l=l 

In the last line we used Lemma This concludes the proof. 

5.3 Proof of Theorem [ 1 ] 

Proof. First, note that by Lemma[^ W (S^, B^) is invertible. Using Lemma[^ in a neighborhood 
of B^ and the set-valued function A{B, S) is constant. Hence, by Lemma]^and Lemmaj^ the 
derivatives of M^{B) = My,{B/^(^b,t.)) continuous at B^ and Furthermore, 
is continuous in C and in B and S at B^ and All toother, W{E,B) is continuous at B^ and 
in all its variables. By the definition of C in Lemmaand the definition of C in Section 
C ^C. 


□ 


5.1 
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Hence, W (S, H) —)• VH(S°, in probability and we obtain that W{B, S) ^ bb {B^, S°) ^ 

in probability. By Theorem and Slutsky’s Theorem we obtain 

V^iM^iB) - Mj,oiB^)YW{B, t)-^^{M^{B) - Mj,,{B^)) - x\p) 

for n —>• oo. Hence 

P[M2o(H°) E C(S,.B)] 

=P[(M^(.B) - Mj^q(B^)YW(B, ±)-^(Mf.(B) - Mj^o(B^)) < -] 

2 . ^ n 

—— Oi 


for n ^ oo. This concludes the proof. 

□ 


5.4 Relaxation-based approach 

A simple approach is as follows: For given a > 0, take random sets TZb, such that 


P[S° €TZ^,B^ E7^B] > 1-a, 

where B^ = (6°, • • ■, b^) is the matrix of regression coefficients in all G groups. A generic 
approach is to choose a confidence region for on the confidence level 1 — a/2 and confidence 
regions for on the confidence level 1 —a/(2G). However, this approach can easily be improved 
by taking larger regions around bg that are far away from zero (thus have negligible influence 
on M.^{B)) and smaller regions around bg that are close to zero. Then calculate 

= {M^(H) :tenj:,Be TZb} C M^, 

which is a 1 — a confidence region for the maximin effect. However, direct computation of this 
confidence region is computationally cumbersome. 

For known the idea can be relaxed to the following scheme: 

For m E and S E positive definite let us define ||m||s := Note that this 

defines a norm on M^. Now, 


II^eo(^')IIeo = min II-^'tIIeo 

7>0.E®=i79 = 1 

= inin ll^'yllso - \\Bj\\y :0 + HHyUso 

7>0.E°=i79 = 1 

< sup lllB'yllso - IlHyllsol + inin HHyUso 

7>0,Ef=i79=l 7>0,Eg=i79 = l 

< sup ||(H'- H) 7||20 + inin \\Bx\\j:o 

7>0.E®=i79 = 1 7>0,E9=i7s=1 

and hence 

G 

||Meo(H')||eo < sup y^Tgll^g - feffllEO + niin HHyUso 
7>0,E^^l79=lff=l 7>0,Eg=l79=l 

= max ||6(^-6g||so+ inin \\Bj\\j^o 
g=l,-,G 7>0,Ef=i79=l 

= max ||6g-5g||so + ||Ms(H)||so 
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By symmetry, 


\\\M-^o{B')\\’^o — \\My,o{B)\\^o\ < max \\b' — bg\\Y,o■ (11) 

g=l,...,G 

We can now choose a covering of the confidence region TZb with B^^'> G = 1,... ,K such 

that balls with radius ek around B^^^ cover TZb with respect to the maximum norm 

||-B||max •= maxg ||6g||so- 

A confidence region of the maximin effect can then be constructed as 

TZ= U {M: lllMlIso - ||Mso(bW)||so| < e^} n CVX . 

k=l,...,K 

This confidence region is valid: For all M^o{B') G TZb there exists k G {1,... ,iF} such that 
||5'_5(fc)||^ax < Cfc- By equation pTj), \\\Mj^o{B')\\^o-\\M^ o{B^’''>)\\y^o\ < Ck, hence Mj^o{B') G 
TZb- This implies TZb C TZb’^ 

P[Mso(B°) G > P[M 2 o(B°) G 7^] > P[B° G 7^s] > 1 - a. 

If is unknown, using the approach above we need to estimate lower and upper bounds for 

II • llso- 


18 


